Method of displaying gene data, and recording medium

ABSTRACT

A method of displaying gene data assists in discovering expression patterns specific to gene functions and inferring the function of a gene of which the function is unknown. A threshold value representing the degree of similarity of expression patterns is established in advance, and genes having the same function and genes with similar expression patterns to them are extracted and displayed. Cluster analysis is performed on the extracted genes by re-selecting the experiment patterns for clustering. Calculations are performed to see how many functions the genes belonging to a subtree have, and the proportion of each function in the subtree is determined. If a proportion exceeds a predetermined threshold value, they are regarded as a cluster (collection of genes with similar functions) and extracted.

This application is the National Phase of International ApplicationPCT/JP00/06385 filed Sep. 19, 2000 which designated the U.S. and thatInternational Application was not published under PCT Article 21(2) inEnglish.

TECHNICAL FIELD

The present invention relates to a method of displaying gene expressiondata obtained as a result of hybridisation with a specified gene in amanner that is visually easy to understand and which allows the functionand role of the gene to be easily conjectured.

BACKGROUND ART

As genome sequences are determined for an increasing variety of species,a great deal of attention is being paid to a so-called genome comparisonmethod aimed at discovering new information from genetic differencesbetween them. The genome comparison method aims to find out genesresponsible for the development of individual species, in order to lookfor groups of genes which are believed to be common to all livingorganisms, or, conversely, estimate characteristics specific toindividual species.

Recent years have witnessed the development of an infrastructure in theform of DNA chips and DNA microarrays (hereinafter referred to as‘biochips’. As a result, the interests of molecular biologists areturning from inter-species data to intra-species data, in other words,they are focusing on the analysis of genes expressed simultaneously in aparticular cell. Thus, there is an increasing number of ways in whichdata is extracted and used, alongside the more conventional comparisonsbetween species.

For instance, if a previously unknown gene is discovered and found toexhibit the same expression pattern as a known gene, it may be inferredto have a similar function to that of the known gene. The functionalsignificance of such genes and the proteins themselves are being studiedin the form of functional units and groups. Meanwhile, as far asinteractions between them are concerned, the direct and indirect effectof a given gene is being analysed by comparison with known enzymereaction data or metabolic data, or more directly by destroying the geneor causing it to overreact, thus eliminating the expression thereof orexpressing it in quantity to study the expression patterns of all genes.An example of success in this field is provided by an expressionanalysis of yeast performed by a group led by P. Brown of StanfordUniversity in the USA (Michael B. Eisen et al.; Cluster analysis anddisplay of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25); 14863–8, 8 Dec. 1998). This group used DNA microarrays tohybridise genes extracted from cells in a time series, representing thedegree of gene expression (intensity of hybridised fluorescent signals)numerically. By allocating colours to the numerical values they thendisplayed the expression processes of the individual genes in a mannerwhich was easy to understand. They then clustered genes with similarexpression pattern processes in a cycle of cells (those with similardegrees of expression at a given point in time).

FIG. 27 is an example of how the result of a standard cluster analysisof gene expression is displayed according to this method. The experimentcases are displayed in the horizontal direction, and the genes arrangedin the vertical direction. The degree of expression of each gene in eachexperiment case is denoted by the density of colour. Denser coloursrepresent higher degrees of expression. A dendrogram is displayed on theleft of the drawing. The dendrogram shows how in the process ofclustering two closest clusters have been merged in each case, while thelength of each branch corresponds to the relative distance between twoclusters on merging.

FIG. 28 is an example of another display representing the similarity ofgene expression patterns. Observed data on individual genes is arrangedon the right, while the dendrogram displayed on the left has beenprepared on the basis of these gene expression patterns.

With developments in biology the functions of genes are gradually beingclarified, and biologists are attempting to analyse them by combiningexpression data and known information. Analysis by dendrogram allowsbiologists to look for biologically meaningful clusters (groups ofgenes). In other words, if the expression patterns of individual genesin a cluster are similar and there are many of known function with thesame pattern, this is extracted as a meaningful cluster. Such clustersare herein referred to as function clusters. Vertical bars 2801 and 2802in FIG. 28 are examples of such function clusters. For instance, ifthere is a gene of unknown function within a function cluster, it ispossible to infer that it possesses a similar function to those in thesame cluster with a known function. What is more, by examining theexpression pattern of a function cluster it is possible to discover theexpression process specific to the function.

A huge amount of gene data needs to be handled in the actual analysis ofgene expression patterns. This is because biochips make it possible toobserve genes of the order of between several thousand and several tensof thousands at the same time. With developments in biochip technologythe number of genes which it is possible to observe simultaneously isset to increase by leaps and bounds, lending powerful support to theprocess of clarifying the mechanism of life.

As the number of such genes increases in this manner, it becomesextremely difficult to comprehend the workings of all of them. Adendrogram will contain thousands or tens of thousands of genes, andeven the subtrees illustrated in FIGS. 27 and 28 will be very complexand include many fine branches, making it difficult to decide what sortof classification has been carried out.

Researchers will have to spend much time and effort choosing functionclusters for these dendrograms. Some commercially available geneexpression clustering tools have display functions for dendrograms andgene names, but none gives any suggestion as to what clusters meritattention.

In view of the above problems with conventional technology, it is afirst object of the present invention to take the results of clustering,extract from them groups of genes having the same function and geneshaving similar expression patterns to the groups of genes, and provide afunction and display for re-analysing these genes. This makes itpossible to assist in discovering specific expression patterns for genefunctions, surmise unknown gene functions, and infer whether or notgenes known to have one function have other functions as well.

It is a second object of the present invention to provide a means ofautomatically sorting clusters of genes having similar expressionpatterns and the same function, and displaying them in a form in whichit is easy for researchers to understand their characteristics.

DISCLOSURE OF THE INVENTION

In order to achieve the first object, the method of displaying gene datato which the present invention pertains comprises the steps ofdisplaying a plurality of gene expression patterns and a dendrogramobtained by cluster analysis of those expression patterns in such amanner as to correspond to each other; specifying the function of aparticular gene and the distance thereof on the dendrogram; andhighlighting that subtree in the dendrogram which contains the gene withthe specified function and which has, as a root, a node whose distancefrom the gene on the dendrogram is less than the specified distance.

This method of displaying gene data does so in a form which facilitatesthe visual appreciation of a plurality of gene expression pattern dataand permits easy conjecture of the function and role of a gene. Itachieves this by first clustering according to gene expression data,and, on a dendrogram showing the results, highlighting the brancheswhich correspond to gene groups having the same functions and thoseexhibiting similar expression patterns to them, thus allowing the userto comprehend the position of these genes in the dendrogram as a whole.

The distance from a gene on the dendrogram may be specified by drawing astraight line crossing branches of the dendrogram.

This method of displaying gene data may further comprise the step ofextracting and displaying only the highlighted subtree and theexpression pattern of a gene corresponding to the highlighted subtree.

The method may further comprise the step of performing cluster analysison the extracted expression patterns.

Moreover, the method may further comprise the steps of specifying arange within which to perform cluster analysis on the extractedexpression patterns, and performing cluster analysis on the expressionpatterns within the specified range.

In order to achieve the second object, a method of displaying gene datato which the present invention pertains comprises the steps ofdisplaying a dendrogram obtained by performing cluster analysis on aplurality of gene expression patterns; specifying the function of genesto be cluster-extracted and a condition for cluster extraction; andhighlighting gene clusters which satisfy the conditions in units ofsubtree of the dendrogram.

This method of displaying gene data does so in a form which facilitatesthe visual appreciation of a plurality of gene expression pattern dataand permits easy conjecture of the function and role of a gene. It iscapable of automatically extracting and displaying clusters where largenumbers of genes exhibiting similar expression patterns and having knownfunctions are gathered.

The condition for extracting clusters may comprise a minimum proportionof genes having the specified function within the subtree, and a minimumnumber of genes in one cluster that have the specified function.

Moreover, in order to achieve the second object, a method of displayinggene data to which the present invention pertains comprises the steps ofdisplaying a dendrogram obtained by performing cluster analysis on aplurality of gene expression patterns; selecting a subtree from thedendrogram; and displaying proportions of genes contained within theselected subtree by function.

Selecting a subtree from the dendrogram obtained by performing clusteranalysis and displaying it in detail allows the user to understand whatsort of gene functions are gathered there, and helps infer unknown genefunctions.

Furthermore, in order to achieve the second object, a method ofdisplaying gene data to which the present invention pertains maycomprise the steps of displaying a dendrogram obtained by performingcluster analysis on a plurality of gene expression patterns; selecting asubtree from the dendrogram; and displaying on a graph an averageexpression pattern of the selected subtree.

Selecting a subtree from the dendrogram obtained by performing clusteranalysis and displaying expression patterns in detail allow the user tounderstand what sort of expression patterns are specific to functions.It is also possible to display variance alongside average expressionvalues.

The recording medium capable of being read by a computer to which thepresent invention pertains is such that a computer program forimplementing a plurality of steps according to any of the above methodsis recorded thereon.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of a screen display according to thepresent invention (screen display wherein genes having the same functionand genes with similar expression patterns to them are highlighted);

FIG. 2 illustrates an example of a screen display according to thepresent invention (screen display wherein only genes having the samefunction and those with similar expression patterns to them have beenextracted and displayed);

FIG. 3 illustrates an example of a screen display according to thepresent invention (screen display when a different method of clusteringhas been applied to the data in FIG. 2);

FIG. 4 illustrates an example of a screen display according to thepresent invention (screen display when further clustering has beenapplied to the data in FIG. 3 after removing a similar expressionpattern region);

FIG. 5 is a system configuration diagram of the present invention;

FIG. 6 illustrates an example of gene data;

FIG. 7 illustrates an example of a gene function name list;

FIG. 8 illustrates an example of a cluster structure;

FIG. 9 illustrates an example of storing a function list in the clusterstructure;

FIG. 10 illustrates an example of generating a cluster tree structure;

FIG. 11 illustrates an example of storing functionally related genes;

FIG. 12 illustrates an example of the range of data to which clusteringis applied;

FIG. 13 illustrates a schematic processing flow of the present system;

FIG. 14 illustrates a detailed flow of cluster analysis;

FIG. 15 illustrates the flow of processing to extract a gene related toa function name;

FIG. 16 illustrates a model of cluster extraction processing;

FIG. 17 illustrates an example of a screen display in accordance withthe present invention (screen displaying a dendrogram and functioncluster);

FIG. 18 illustrates an example of a screen display in accordance withthe present invention (screen displaying subtree data);

FIG. 19 illustrates an example of a cluster structure;

FIG. 20 illustrates an example of storing a function list in a clusterstructure;

FIG. 21 illustrates an example of generating cluster tree structures;

FIG. 22 illustrates an example of a structure for storing the results;

FIG. 23 illustrates a schematic flow of the present system;

FIG. 24 illustrates a detailed flow of cluster analysis (cluster treegeneration);

FIG. 25 illustrates a detailed flow of cluster analysis (automaticcluster extraction);

FIG. 26 illustrates a detailed flow of cluster extraction processing(process A);

FIG. 27 illustrates an example of how the results of standard clusteranalysis are displayed; and

FIG. 28 illustrates an example of how the results of standard clusteranalysis are displayed.

BEST MODE FOR CARRYING OUT THE INVENTION

There follows, with reference to the appended drawings, a more detaileddescription of the present invention.

First Embodiment

To begin with, there follows a description of an embodiment aimed atachieving the first object of the present invention.

FIG. 1 illustrates an example of a screen display aimed at achieving thefirst object of the present invention, where genes having the samefunction and genes with similar expression patterns to them arehighlighted. If the user selects one function of a gene, the systemsearches the dendrogram for genes having that function and genes withsimilar expression patterns to them, and highlights them. Genes havingthe selected function are those with a mark numbered 101. Genes withoutthis mark have different or unknown functions.

By similar expression pattern is meant that the distance between theclusters is small, which is to say the branch on the dendrogram isshort. A threshold value is placed on the distance, and genes in asubtree whose distance is less than the threshold value are regarded ashaving similar expression patterns. In the dendrogram illustrated inFIG. 1, the vertical broken line 100 represents the threshold value andcrosses several (more than two) branches of the dendrogram. In theexample illustrated, genes which share the same subtree from this line100 as far as the leaves are regarded as having similar expressionpatterns, and the relevant branches of the tree pattern are highlighted.

By highlighting effectively those groups of genes which have the samefunction and genes with similar expression patterns to them, this methodof display makes it possible to see at a glance what positions theyoccupy on the dendrogram as a whole. Gene groups of this sort will bereferred to as functionally related genes.

FIG. 2 is a screen display in which only the genes highlighted in FIG. 1are displayed. In other words, only those genes having the same functionand those with similar expression patterns to them are extracted anddisplayed. As shown in FIG. 2, gathering functionally related geneswhich have hitherto been dispersed over a dendrogram together fordisplay allows users to infer the expression pattern specific to afunction. In the display example illustrated in FIG. 2 there is anexpression pattern common to each gene in the range 200 of part of theexperiment case (horizontal axis), and it can be inferred that thisrange 200 may comprise a pattern specific to a function.

FIG. 3 illustrates an example of a screen display when a differentmethod of clustering has been applied to the data in FIG. 2. Meanwhile,FIG. 4 illustrates an example of a screen display when, after removingfrom FIG. 3 the expression pattern (300) inferred to be specific to thefunction, further clustering has been applied to the remaining data. Byreanalysing functionally related genes in this manner it is possible tohelp surmise unknown gene functions, and infer whether or not the geneshave other functions.

FIG. 5 is a system configuration diagram of the present invention. Thissystem comprises gene data 501 which records gene information andexpression processes, a clustering processor 500 which performsclustering in accordance with the gene expression processes and analysesit in order to display it in the form of a dendrogram, a display device502 for displaying the dendrogram, a keyboard 503 and mouse 504 forinputting values into the system and operating selections, and a genefunction name list 505 which is used for automatically extractingfunction clusters. The clustering processor 500 comprises a computer andits program. The program can be recorded on a CD-ROM or other recordingmedium, and is loaded by being read by the computer. Alternatively itmay be downloaded from another computer by way of a network. The genedata may be obtained from a database which is managed by a servercomputer located at a distance, instead of using the data stored in thememory device 501.

FIG. 6 illustrates the detailed structure of the gene data 501. Geneinformation is stored in a sequence of an m number of elements calledgene[i] (i=1, 2, . . . m), where m is the number of genes included inthe gene data. Gene data comprises gene ID (600) which uniquelyidentifies the gene, attribute information (601) which represents thegene, and expression data (602) obtained from a DNA chip, DNA micorarrayor other biochip. Attributes representing the gene include, e.g., genename (603), ORF (604) and gene function (605). Other gene attributes maybe defined as members of a gene information structure. In expressiondata (602), there is stored numerical data on the degree of expressionof the gene in each experiment (intensity of fluorescent signals afterhybridisation reaction). In the present embodiment, the number ofexperiments is n, and expression data for one gene is treated as ann-dimensional vector.

FIG. 7 illustrates the concrete structure of the gene function name list(505). The gene function name list is represented by a sequencefuncList[ ] comprising a number func_num of elements. Names of functionsare contained within the sequence. The index of the sequence funcList[ ]is represented as funcIdx, and is treated as an ID corresponding to thefunction. Genes with unknown functions are also recorded in thefuncList[ ] with, for instance, the function name ‘UNKNOWN’.

FIG. 8 illustrates an example of a cluster structure used in clustering.Each cluster structure corresponds to a node or leaf of the dendrogram.In order to identify the individual clusters, each cluster structure isuniquely represented by the pair of a clusterNo (800) and a clusteringID(806). The clusteringID (806) is an ID uniquely determined for eachmethod of clustering, while the clusterNo (800) is used as an ID torepresent the node in one clusteringID.

There are two types of cluster structure, i.e., one for leaf (left) andanother for node (right), corresponding to clusters representing leavesand those representing medial nodes respectively, denoted by the valueof the type member (801). The node-type cluster structures are generatedsuccessively in merge processing during clustering. Two clusters priorto merging can be traced from the left (802) and right (803) values, andthe distance ((dis)similarity) between them is retained as the distance(804) value. The clusterNo (800) is included in the left and rightvalues. The leaf type cluster structures, on the other hand, eachcorrespond to one gene, and gene information structure data can bereferred to by storing GENE ID (600) in geneID (805).

In the case of node type cluster structures, the functions of the genescorresponding to leaf type clusters which belong to clusters are storedby type in the leafFuncList (807) in a list structure. In the case ofleaf type cluster structures, the functions of corresponding genes arestored in leafFuncList (807) in a list structure. One list comprises anidx (808) for storing the function ID, and a NextPtr (809) for storing apointer to the next list. The function ID which goes in idx is the indexof the funcList in the gene function name list. If a gene has aplurality of functions, these are added in the leafFuncList. Forinstance, if a gene has the functions ‘TRANSPORT’, ‘TCA CYCLE’ and‘GLYCOLYSIS’, the funcList comprises three lists.

FIG. 9 illustrates an example of the function list leafFuncList (807)stored in the cluster structure. The function name of each gene is givenon the right of the dendrogram. Since the gene of the leaf linked to thenode 900 has the functions ‘UNKNOWN (funcIdx: 1)’, ‘TRANSPORT (funcIdx:2)’, ‘GLYCOLYSIS (funcIdx: 3)’and ‘TCA CYCLE (funcIdx: 4), theleafFuncList of the cluster structure is expressed in the formillustrated in the drawing.

FIG. 10 illustrates the data structure generated in the process ofcluster analysis. At first only leaf-type cluster structures areprepared, but in the process of cluster analysis they are merged two bytwo, generating a node-type cluster structure each time to assemble atree structure. These link structures are managed separately byclusteringID (806). This is because the clusteringID is determined bythe method of clustering, and the tree structure changes if the methodof clustering does.

FIG. 11 illustrates an example of a sequence extractNodes[ ] comprisinga number en_num of elements for storing functionally related genes. Hereare stored the clusterNos of root nodes of subtrees of groups of genesregarded as functionally related genes. For instance, as may be seenfrom FIG. 11, when genes with a given function are in positions 1100 andthe threshold value which determines the similarity of expression is seton the dendrogram in the position shown by the broken line 1101, theclusterNos of nodes severed by the line 1101 determining the thresholdvalue are stored in extractNodes[ ].

FIG. 12 illustrates an example of a sequence clustering_dims[ ]comprising a number dim_num of elements for the purpose of storinginformation on the range of data to which clustering is applied. By therange of data to which clustering is applied is meant the range ofsuffix number (corresponding to each experiment) when the n-dimensionalgene expression vector data is represented by (x₁, x₂, . . . , x_(n)).The suffix number corresponding to the object data is stored in thesequence clustering_dims[ ]. For instance, as is shown in FIG. 12, whenthere are expression data from experiment 1 to experiment 11, and therange from expression data 5 to 7 designated 1200 is excluded from thedata to which clustering is applied, the details of the sequenceclustering_dims[ ] are as illustrated in the drawing.

FIG. 13 illustrates an outline processing flow of the gene clustermethod according to the present embodiment.

Firstly, data is read from the gene expression pattern data into theclustering processor 500 (step 1300). Next, 1 is substituted in clustID,which is an ID representing the method of clustering, and 1, 2, 3, . . ., n are substituted in the clustering applied data regionclustering_dims[ ], in the order from the initial element, forinitialisation. Then m is substituted in gnum, which shows the totalnumber of data for clustering (step 1301). Next, the parameters requiredfor cluster analysis are set (step 1302).

Once the parameters have been initialised and set, cluster analysis isperformed (step 1303). This will be described in detail below. Next, theresults of the analysis are displayed (1304). Here, the data for displaywhich has previously been collected and calculated (relative distancebetween clusters) is employed to create a dendrogram and display genenames and functions.

If genes having the same function are to be displayed within thedendrogram at this point, a threshold value is set to show the degree ofsimilarity in expression patterns, and a desired function name isselected (steps 1306, 1307). The threshold value may be set by selectingan appropriate value from the display of clustering results (forinstance by moving the mouse right and left along the line 100 ofthreshold values shown in FIG. 1). If genes having the same function arenot to be displayed in step 1305, the process terminates.

Next, genes with the function name selected in step 1307 and thosehaving similar expression patterns to them are extracted (step 1308) byusing as an argument the cluster corresponding to the root of thedendrogram that has just been generated. This will be described indetail below. After this process, branches corresponding to thefunctionally related genes are highlighted as denoted by the thick linein FIG. 1 (step 1309) on the basis of the information about theclusterNo of the subtree root of the extracted genes (functionallyrelated genes) stored in the sequence extractNodes[ ].

If the user wishes to focus on a function name other than that selectedat step 1307, he returns to step 1306 and continues the process (step1310). If not, the dendrogram is re-displayed with only the extractedgenes (functionally related genes) as in FIG. 3 (step 1311).

If further clustering is to be performed on the group of functionallyrelated genes, the following processing is performed. Firstly, if it isdesired to apply clustering after narrowing the range of data to whichclustering is to be applied, the sequence clustering_dims[ ] is renewed.In other words, as may be seen from FIG. 12, the suffix number of thegene expression vector data for clustering is written inclustering_dims[ ]. This suffix number for clustering can also bewritten by specifying the range on screen with the mouse or otherwise.Then, the clustering method ID clustID and the clustering applied datarange are re-set. First, clustID is incremented by 1. The data which isto be read in the clustering process as the clustering applied datarange is replaced with the gene group of the functionally related genesextracted at step 1308, and the number of functionally related genes issubstituted in gnum, which indicates the total number of data which areto be clustered. Then the program returns to step 1302, and clusteringis performed. In step 1312 processing is terminated if no moreclustering is applied.

FIG. 14 illustrates the detailed flow of the process of cluster analysisin FIG. 13 (step 1303).

Firstly, in the n-dimensional vector (602) formed of expression datacorresponding to each gene ID as shown in FIG. 6, the components of thatvector whose suffix numbers correspond to the sequence clustering_dims[] are taken and new vector data corresponding to the genes is created,where the new vector consists of only taken components. Leaf-typecluster structures corresponding to a number gnum of individual genesare generated and registered as clusters for merger (step 1400). To theclusterNo member values (800) of the cluster structure are allocated 1,2, 3, . . . , in the order of the gene data that is fed. The gene ID(600) is registered in the geneID member value (805), the clustID isregistered in the clusteringID member value (806), and the function ofthe corresponding genes is registered in leafFuncList member value(807).

Next, the number of clusters to be merged cnum and the number ofnode-type cluster structures ncls are initialised to gnum and 0,respectively (step 1401). It is then determined to see whether thenumber of clusters to be merged cnum is equal to 1 (step 1402). If not,the following procedure is repeated until it becomes 1. If it is equalto 1, the process is terminated.

Firstly, the two clusters with a minimum relative distance from theregistered clusters which are to be merged are selected (step 1403).Next, a node-type cluster C is newly generated (step 1404), and thenumber of node-type clusters is incremented (step 1405). The twoclusters selected at step 1403 and the distance between them isregistered in the left member (802), right member (803) and distancemember (804) of the new node-type cluster, and leafFuncLists of the twoclusters are added in the leafFuncList member (807). In addition,clustID is registered in the clusteringID member (806), and gnum+nclusis registered in the clusterNo member (800) of C (step 1406).

It is also possible to establish assessment criteria in advance as towhich of the two clusters is made the left member and which the right.Finally, these two clusters are excluded from the clusters for merger,the new node-type cluster is registered (step 1407), the value of thenumber of clusters for merger cnum is decremented (step 1408), and theprocess is continued from step 1402.

FIG. 15 illustrates the detailed flow of processing to extract the generelated to the function name in FIG. 13 (step 1308).

First, the type member value of the cluster given as the argument isexamined. If it is leaf, the process is terminated (step 1500). Next,the right member cluster (Cr) of the cluster given as an argument isexamined to see whether or not it contains a function-related gene. Inother words, it is examined to see whether or not the function ID of thefunction name selected at step 1307 in FIG. 13 is included in the listof Cr leafFuncList (807). If it is not included, the process isterminated (step 1501).

If the function corresponding to the cluster Cr is included, it isexamined to see whether or not the Cr distance member (804) is smallerthan the threshold value determined at step 1306 in FIG. 13 (step 1502).If it is smaller, the Cr clusterNo member value (800) is registered inthe sequence (extractNodes[ ]) where the function-related genes arestored (step 1503). If the distance member is greater than the thresholdvalue at step 1502, the genes related to the function name are againextracted with the cluster Cr as the argument (process illustrated inFIG. 15).

The same process is performed on the left member cluster of the clustergiven as the argument, and the process is terminated (steps 1505–1508).

In this manner it is possible to display and analyse the results ofcluster analysis as illustrated in FIGS. 1–4.

Second Embodiment

There now follows a description of an embodiment aimed at achieving thesecond object of the present invention. The system of the presentembodiment is configured in the same manner as in FIG. 5. Moreover, thegene data and gene function name list used are the same as thoseillustrated in FIGS. 7 and 8 and described in relation to the firstembodiment.

The present embodiment calculates how many functions are there in thegenes belonging to a subtree, and determines the proportion of eachfunction in the subtree. If a proportion in the subtree exceeds apreviously determined threshold value, it is regarded as a functioncluster and extracted. In order to prevent a single gene from beingregarded as a function cluster by itself, at least the number of genescontained in a cluster is determined in advance as a threshold value.

FIG. 16 illustrates a model of cluster extraction processing. Theprocess illustrated is that of searching for function clusters relatingto genes with GLYCOLYSIS as their function. In this example, as shown onthe right of FIG. 16, the minimum proportion having the function ofGLYCOLYSIS in the subtree is set at 0.40, and clusters containing atleast three genes are to be selected.

In the case of the example illustrated in FIG. 16, it can be seen that,as to the root node 1600 of the dendrogram, the number of genes havingthe function GLYCOLYSIS is five, so that the proportion of such genes tothe total number of genes in the subtree (17) is 5/17=0.29. This issmaller than the minimum proportion (0.40) set as the threshold value,and therefore the subtree of node 1600 is not regarded as a functioncluster related to the function GLYCOLYSIS.

Next, the proportion of genes having the function GLYCOLYSIS iscalculated in the same manner for the two sub-nodes 1601 and 1602belonging to the node 1600. When these nodes are considered as roots theproportions are 0.00 and 0.36, respectively, so that the nodes 1601 and1602 are not regarded as function clusters in relation to the functionGLYCOLYSIS. As far as node 1601 is concerned, if the sub-nodes to theleft and right are regarded as roots of a subtree, the numbers of genesare two and one, respectively, which does not fulfil the condition ofselecting a cluster with at least three genes. Thus the search is notcontinued.

The proportion of genes having the function GLYCOLYSIS is calculated inthe same manner for the sub-nodes 1603 and 1604 on the left and right ofthe node 1602. In node 1604 the proportion of genes having the functionGLYCOLYSIS is 0.44, which is higher than the proportion determinedaccording to the threshold value. As a result, this is regarded as afunction cluster. On the other hand, the node 1603 and its sub-node 1605have proportions of GLYCOLYSIS lower than the threshold value, and aretherefore not regarded as constituting function clusters. The functionclusters are determined in this manner.

FIG. 17 is an example of screen display according to the presentembodiment.

Function clusters are shown by drawing vertical bars beside thedendrogram. There are cases as with 1701, 1702 where the bars overlap.This is because the genes have a plurality of functions, and thefunction cluster parts are displayed for both functions. The functionclusters only need to be highlighted in such a manner as to bedistinguishable from other parts, and there are other methods ofachieving this apart from drawing bars. Examples include changing thecolour of the clusters and surrounding them with a frame.

FIG. 18 is another example of a screen display in accordance with thepresent embodiment. In this example, selecting a branch of a subtreewith the mouse causes the proportions of gene functions contained thereto be displayed on a circle graph. Further, the average and statisticalvariance of expression patterns of individual genes belonging to thesubtree selected with the pointer 1801 are calculated and displayed onthe graph, where the horizontal axis shows the experiment case, forexample time. Adopting this method of display particularly in relationto function clusters is useful for inferring the function of genes ofwhich the function is not yet known, and also makes it possible todiscover expression patterns specific to functions.

FIG. 19 illustrates an example of cluster structure used in theclustering process in the present embodiment. Each cluster structurecorresponds to a node or leaf on the dendrogram. In order to identifyeach cluster, a unique clusterNo (1900) is allocated to each clusterstructure. There are two types of cluster structure. These correspond toclusters representing leaves and those representing intermediate nodes,and are divided by the type member (1901) value respectively into leaf(left) and node (right).

The node-type cluster structures are generated successively in theprocess of merging during clustering, and the two clusters prior tomerging can be traced from the left (1902) value and right (1903) value.Moreover, the distance ((dis)similarity) between them is retained as thedistance (1904) value. A unique clusterNo (1900) representing thecluster is entered in left and right.

The leaf-type cluster structures each correspond to one gene, and dataon gene information structures can be referred to by storing the gene ID(600) in geneID (1905). In the case of node-type cluster structures, thenumber of leaf-type structures belonging to the cluster is stored inleafnum (1906), while the functions of genes corresponding to theleaf-type structures belonging to the cluster are stored by type inleafFuncList (1907) in a list structure. In the case of leaf-typecluster structures, 1 is stored in leafnum (1906), while the function ofthe corresponding gene is stored in leafFuncList (1907) in a liststructure.

One list comprises idx (1908) for storing function ID, Num (1909)showing the number of times that function appears in the subtree, andNextPtr (1910) for storing the pointer to the next list. The function IDstored in idx is the index of funcList in the gene function name list.

Where a gene has a plurality of functions, 1 is divided by the number offunctions, and the number of times Num (1909) a function appears isrepresented as an equal fraction of 1, or alternatively each of theplurality of functions may be represented by 1. For instance, if a genehas the functions ‘TRANSPORT’, TCA CYCLE’ and ‘GLYCOLYSIS’, and thenumber of times a function appears is represented as n equal fraction of1, funcList will comprise three lists, and Num will be 0.33 in each ofthem.

FIG. 20 illustrates an example of storing a function list leafFuncList(1907) in the cluster structure illustrated in FIG. 19. The functionname of each gene is shown on the right of the dendrogram. Of the genesjoined to node 2000, four have the function “UNKNOWN (funcIdx: 1)”, four“TRANSPORT (funcIdx: 2)”, seven “GLYCOLYSIS (funcIdx: 3)”, and one “TCACYCLE (funcIdx: 4)”, so that the leafFuncList of the cluster structureis expressed in the form shown in the diagram.

FIG. 21 illustrates data structure generated in the process of clusteranalysis. At first only leaf type structures are prepared, but in theprocess of cluster analysis they are merged two by two, generating anode-type cluster structure each time to assemble a tree structure. Thenode-type clusters create pointers in the order in which they aregenerated so that they can be traced from the sequence node_clusters[ ].The variable nclus retains the total number of node-type clusterstructures created.

FIG. 22 illustrates the sequence results[i] (i=1, 2, 3, . . . ,func_num) of the structure for storing results. The index i ofresults[i] corresponds to each function ID (funcIdx). In other words,one results[ ] element is allocated to each function. The structureresults[ ] members comprise threshold value and extracted results. Thethreshold value comprises threshold rate (2200), that is the proportionof a function which should be contained in one subtree, and thresholdleaf (2201), that is the minimum number of leaf-type clusters whichshould be contained in the subtree. The extraction result is representedby result (2202). Here the clusterNo of intermediate nodes (typeclusters) representing function clusters is stored.

The threshold value can be set by the user by operating the keyboard ormouse. In particular, as to the threshold rate (2200), a certain valuemay be given to individual functions uniformly. If the proportion of anyone function is relatively large from the beginning, the proportions maybe varied accordingly. Several other ways may be contemplated.

FIG. 23 illustrates the outline process flow of the gene clusteringmethod according to the present embodiment.

Firstly, data is read from the gene expression pattern data into theclustering processor 500 (step 2300). Next, the various parameters andthreshold values required for cluster analysis are set (steps 2301,2302). Once the parameters have been set, cluster analysis is performed(step 2303). During the process of cluster analysis the informationrequired for function cluster display according to the present inventionis collected and the data for use in display are calculated. This willbe described in detail below.

Next, the results of the analysis are displayed (2304). The data fordisplay which has been collected and calculated (relative distancebetween clusters) is used to create a dendrogram, and gene names andfunctions are displayed. The leaf nodes (leaf-type cluster structures)linked to the intermediate nodes (node-type cluster structures)specified by result members in the results[ ] sequence are indicated bybars such as designated 1701, 1702 in FIG. 17.

If a subtree is to be selected and displayed at this point, thedistribution of gene functions of the leaf nodes included in a selectedsubtree are displayed as shown in FIG. 18, and an average expressionpattern of the genes is displayed (steps 2305, 2306). For display, sincethe distribution of functions is stored in the leafFuncList (1907) ofthe intermediate nodes (node-type clusters) corresponding to the subtreeselected, function distribution may be created on the basis of thestored distribution, while the average and variance of expressionpattern may be created on the basis of the expression data (602) in thegene data sequence gene[ ] by tracing back to the leaf cluster. If nosubtree is selected, the process terminates.

FIG. 24 illustrates the detailed flow of the process of cluster analysis(step 2303) in FIG. 23, and relates to the process of generating acluster tree as the first stage.

Firstly, an m number of n-dimensional vector data (602) corresponding toeach gene ID illustrated in FIG. 6 are taken as an m number of leaf-typecluster structures and registered as clusters for merger (step 2400).The clusterNo is added to the gene[ ] index, the geneID (1905) is addedto GENE ID (600), leafnum (1906) is made 1, and corresponding genefunctions are added to leafFuncList (1907).

Next, the value cnum of the number of clusters for merger and the numberncls of node-type cluster structures so far created are initialised to mand 0, respectively (step 2401). It is assessed to see whether or notthe number of clusters for merger is equal to 1 (step 2402). If it isnot equal, the process outlined below is repeated until it becomes 1.

Firstly, two clusters with a least relative distance from the registeredclusters which are to be merged are selected (step 2403). Next, a newnode-type cluster C is generated (step 2404), and the number ofnode-type clusters is incremented (step 2405). The new node-type clusteris registered in an nclus-th component of the sequence node-clusters[ ](step 2406). The two clusters selected at step 2403 and the distancebetween them are registered in the left member (1902), right member(1903) and distance member (1904) of the new node-type cluster,respectively, the sum of the two clusters' leafnum is registered in theleafnum member (1906), and the sum of the two clusters' leafFuncList isregistered in the leafFuncList member (1907). m+nclus is registered inthe clusterNo member (step 2407).

Here, it is possible to establish assessment criteria in advance as towhich of the two clusters should be regarded as the left member andwhich as the right member. Finally, these two clusters are excluded fromthose destined for merger, a new node-type cluster is registered (step2408), and the value of the cluster number cnum destined for merger isdecremented (step 2409). If the value of cnum in the assessment at step2402 is equal to 1, the procedure goes to the flow illustrated in FIG.25.

FIG. 25 illustrates the detailed flow of cluster analysis (step 2303) inFIG. 23, the flow relating to automatic extraction of function clustersat a second processing stage.

Firstly, idx which represents the index of the gene function name listis initialised to 1 (step 2500). In the processes hitherto C has beenmade into a root node of the dendrogram. All the genes belonging to thedendrogram are assessed to see whether or not the proportion of thosewhose function is funcList[idx] is greater than the proportion offunctions which should be contained in the subtree (threshold ratemember value of result[idx]) (step 2501). If it is greater, theclusterNo member value of C is registered in the result member value ofthe results[idx] (step 2502). If it is smaller, cluster extraction(process A) is performed with C and idx as arguments (step 2503).Process A will be described in detail below.

Then idx is incremented by 1. steps 2501–2504 are performed until idxbecomes func_num, i.e., for all the functions in the gene function namelist (steps 2504, 2505). The whole process terminates when idx becomesfunc_num.

FIG. 26 illustrates the detailed flow of process A (step 2503) in FIG.25.

First, the type member value of the cluster given as the arguments isexamined. If it is a leaf, the process is terminated (step 2600). Next,the right member cluster of the cluster given as the argument isexamined to see whether or not it is a function cluster. First, themember value leafNum of the cluster (Cr) shown by the right member ofthe argument cluster is examined (step 2601) to see if it is greaterthan the minimum leaf number of the threshold value, i.e., the thresholdleaf (2201) member value of result [idx]. If it is smaller, process Aterminates.

If it is greater, the subtree with cluster Cr as the root is examined tosee whether, of the genes belonging to that subtree, the proportion ofthe genes with the function funcList[idx] is greater than the thresholdvalue. In other words, the number of functions corresponding tofuncList[idx] of the leafFuncList (1907) of Cr is examined to seewhether the value thereof divided by the leafnum (1906) is greater thanthe threshold rate member value (2200) of result[idx] (step 2602)), Ifit is greater, the clusterNo member value of C is registered in theresult member value of results[idx] (step 2603). If it is smaller,cluster extraction processing (process A) is performed with Cr and idxas arguments (step 2604).

Next, it is examined to see whether the left member cluster of thecluster given as the argument is a function cluster in the same manneras in steps 2601–2604. Process A terminates when all the aboveprocessing is over.

By means of the above processing it is possible to display the resultsof cluster analysis as illustrated in FIGS. 17 and 18.

INDUSTRIAL APPLICATION

As has been explained above, by highlighting a gene group having thesame function and genes having expression patterns similar to the genesin that group on the basis of the results of clustering, the presentinvention makes it possible for one to comprehend where on a wholedendrogram those genes are located. Moreover, by extracting these genesand comparing their expression patterns, expression patterns specific toindividual functions can be found. Furthermore, by performing adifferent method of clustering for cluster analysis on the extractedgenes, for example, estimation of the function of genes with hithertounknown functions and estimation about whether or not they have otherfunctions can be facilitated.

Further, the present invention allows a group of genes with similarinter-gene expression patterns and with a number of the same knownfunctions to be extracted automatically. By selecting a subtree of afunction cluster and displaying it in detail, what gene functions aregathered there can be known, which facilitates the estimation offunctions of genes having hitherto unknown functions. The invention alsomakes it possible for one to understand what patterns are specific toindividual functions.

1. A method of displaying gene data, comprising steps of: displaying adendrogram obtained by performing cluster analysis on a plurality ofgene expression patterns; specifying a function of a gene to becluster-extracted, and a condition for function cluster extraction; andhighlighting in the dendrogram a gene function cluster which satisfiesthe condition in units of subtrees in the dendrogram, wherein thespecified function is transport, glycolysis, or TCA cycle, and thecondition for extracting function clusters comprises a minimum ratio ofa number of genes having the specified function within the subtree to atotal number of genes within the subtree, and a minimum number of genescontained in one function cluster that has the specified function.